A Broad-Coverage Normalization System for Social Media Language
نویسندگان
چکیده
Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both wordand messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.
منابع مشابه
A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation
Social media texts are written in an informal style, which hinders other natural language processing (NLP) applications such as machine translation. Text normalization is thus important for processing of social media text. Previous work mostly focused on normalizing words by replacing an informal word with its formal form. In this paper, to further improve other downstream NLP applications, we ...
متن کاملContext Tailoring for Text Normalization
Language processing tools suffer from significant performance drops in social media domain due to its continuously evolving language. Transforming non-standard words into their standard forms has been studied as a step towards proper processing of ill-formed texts. This work describes a normalization system that considers contextual and lexical similarities between standard and non-standard wor...
متن کاملA Cascaded Approach for Social Media Text Normalization of Turkish
Text normalization is an indispensable stage for natural language processing of social media data with available NLP tools. We divide the normalization problem into 7 categories, namely; letter case transformation, replacement rules & lexicon lookup, proper noun detection, deasciification, vowel restoration, accent normalization and spelling correction. We propose a cascaded approach where each...
متن کاملSocial Text Normalization using Contextual Graph Random Walks
We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram ...
متن کاملText Normalization in Social Media: Progress, Problems and Applications for a Pre-processing System of Casual English
The rapid expansion in user-generated content on the Web of the 2000s, characterized by social media, has led to Web content featuring somewhat less standardized language than the Web of the 1990s. User creativity and individuality of language creates problems on two levels. The first is that social media text is often unsuitable as data for Natural Language Processing tasks such as Machine Tra...
متن کامل